53 research outputs found

    Computational Approaches to Address the Next-Generation Sequencing Era

    Get PDF
    In this thesis, I propose new algorithms and models to address biological problems. Computer science in fact plays a key role in proteomics and genetics research due to the advent of big datasets. In the context of protein study, I developed new methods for protein function prediction based on information retrieval principles. By using heterogeneous source of knowledge, like graph search and sequence similarity, I designed a tool called INGA that can be used to annotate entire genomes. It has been benchmarked during the Critical Assessment of Function Annotation challenge, and it proved to be one of the most effective approach for function inference. To better characterize proteins from the structural point of view, I proposed a protein conformers detection strategy based on residue interaction network (RIN) data. RIN graphs were extended to deal with the time-dependent protein coordinate fluctuations, and were generated by clustering algorithms. An implementation called RING MD highlighted effectively the key amino acids known to be functionally relevant in Ubiquitin. These amino acids in fact are very important to explain the protein three-dimensional dynamics. With the same rationale, RIN graphs were used also to predict the impact of mutations within a protein structure. By combining information about a mutant node in the network and its features, an artificial neural network was trained to estimate the free Gibbs energy change of a protein. Extreme changes in the internal energy might lead to the protein unfolding, and possibly to disease. The reduction of a protein flexibility may hamper its function as well. As an example, the extreme fluctuations observed in intrinsically disordered proteins (IDPs) are fundamental for their activities. To better understand IDPs, I contributed in the collection of the largest dataset of disordered regions. In the following analysis, it was shown what are the typical functions of these sequences and the biological processes where they are involved. Due to the importance of their detection, a comprehensive assessment of disorder predictors was performed to show what are the state-of-the-art methods and their limitations. In the context of genetics, I focused on phenotype prediction. During the Critical Assessment of Genome Interpretation (CAGI), I proposed new approaches for the analysis of exome data to prioritize the risk of Crohn's disease and abnormal cholesterol levels. These are often defined as complex disease, since the mechanism behind their insurgence is still unknown. In my study, human samples with an enrichment of mutations in critical genes were predicted to have an high genetic risk. In addition to disease associated genes, protein interaction networks were considered to better account for variants accumulation in biological pathways. Such strategy was shown to be among the best approaches by CAGI organizers. In the simpler case of Mendelian traits, with BOOGIE I designed a method for human blood groups prediction based on exome data. It uses a specialized version of nearest neighbor algorithm in order to match the gene variants in an unannotated exome with the ones available in a reference knowledge base. The most similar hit is used to transfer the blood group. With an accuracy above 90%, BOOGIE is a proof-of-concept that shows the potential applications of genetic prediction, and can be easily extended to any Mendelian trait. To summarize, this thesis is a partial answer to the exponential growth of sequences available that need further experiments. By integrating heterogeneous information and designing new predictive models based on machine learning, I developed novel tools for biological data analysis and classification. All implementations are freely available for the community and might be helpful during future investigations like in drug design and disease studies

    Exploring Subgroup Performance In End-to-End Speech Models

    Get PDF
    End-to-End Spoken Language Understanding models are generally evaluated according to their overall accuracy, or separately on (a priori defined) data subgroups of interest. We propose a technique for analyzing model performance at the subgroup level, which considers all subgroups that can be defined via a given set of metadata and are above a specified minimum size. The metadata can represent user characteristics, recording conditions, and speech targets. Our technique is based on advances in model bias analysis, enabling efficient exploration of resulting subgroups. A fine-grained analysis reveals how model performance varies across subgroups, identifying modeling issues or bias towards specific subgroups. We compare the subgroup-level performance of models based on wav2vec 2.0 and HuBERT on the Fluent Speech Commands dataset. The experimental results illustrate how subgroup-level analysis reveals a finer and more complete picture of performance changes when models are replaced, automatically identifying the subgroups that most benefit or fail to benefit from the chang

    An Expanded Evaluation of Protein Function Prediction Methods Shows an Improvement In Accuracy

    Get PDF
    Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent

    An expanded evaluation of protein function prediction methods shows an improvement in accuracy

    Get PDF
    Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent. Keywords: Protein function prediction, Disease gene prioritizationpublishedVersio

    La renovación de la palabra en el bicentenario de la Argentina : los colores de la mirada lingüística

    Get PDF
    El libro reúne trabajos en los que se exponen resultados de investigaciones presentadas por investigadores de Argentina, Chile, Brasil, España, Italia y Alemania en el XII Congreso de la Sociedad Argentina de Lingüística (SAL), Bicentenario: la renovación de la palabra, realizado en Mendoza, Argentina, entre el 6 y el 9 de abril de 2010. Las temáticas abordadas en los 167 capítulos muestran las grandes líneas de investigación que se desarrollan fundamentalmente en nuestro país, pero también en los otros países mencionados arriba, y señalan además las áreas que recién se inician, con poca tradición en nuestro país y que deberían fomentarse. Los trabajos aquí publicados se enmarcan dentro de las siguientes disciplinas y/o campos de investigación: Fonología, Sintaxis, Semántica y Pragmática, Lingüística Cognitiva, Análisis del Discurso, Psicolingüística, Adquisición de la Lengua, Sociolingüística y Dialectología, Didáctica de la lengua, Lingüística Aplicada, Lingüística Computacional, Historia de la Lengua y la Lingüística, Lenguas Aborígenes, Filosofía del Lenguaje, Lexicología y Terminología

    Computational Approaches to Address the Next-Generation Sequencing Era

    Get PDF
    In this thesis, I propose new algorithms and models to address biological problems. Computer science in fact plays a key role in proteomics and genetics research due to the advent of big datasets. In the context of protein study, I developed new methods for protein function prediction based on information retrieval principles. By using heterogeneous source of knowledge, like graph search and sequence similarity, I designed a tool called INGA that can be used to annotate entire genomes. It has been benchmarked during the Critical Assessment of Function Annotation challenge, and it proved to be one of the most effective approach for function inference. To better characterize proteins from the structural point of view, I proposed a protein conformers detection strategy based on residue interaction network (RIN) data. RIN graphs were extended to deal with the time-dependent protein coordinate fluctuations, and were generated by clustering algorithms. An implementation called RING MD highlighted effectively the key amino acids known to be functionally relevant in Ubiquitin. These amino acids in fact are very important to explain the protein three-dimensional dynamics. With the same rationale, RIN graphs were used also to predict the impact of mutations within a protein structure. By combining information about a mutant node in the network and its features, an artificial neural network was trained to estimate the free Gibbs energy change of a protein. Extreme changes in the internal energy might lead to the protein unfolding, and possibly to disease. The reduction of a protein flexibility may hamper its function as well. As an example, the extreme fluctuations observed in intrinsically disordered proteins (IDPs) are fundamental for their activities. To better understand IDPs, I contributed in the collection of the largest dataset of disordered regions. In the following analysis, it was shown what are the typical functions of these sequences and the biological processes where they are involved. Due to the importance of their detection, a comprehensive assessment of disorder predictors was performed to show what are the state-of-the-art methods and their limitations. In the context of genetics, I focused on phenotype prediction. During the Critical Assessment of Genome Interpretation (CAGI), I proposed new approaches for the analysis of exome data to prioritize the risk of Crohn's disease and abnormal cholesterol levels. These are often defined as complex disease, since the mechanism behind their insurgence is still unknown. In my study, human samples with an enrichment of mutations in critical genes were predicted to have an high genetic risk. In addition to disease associated genes, protein interaction networks were considered to better account for variants accumulation in biological pathways. Such strategy was shown to be among the best approaches by CAGI organizers. In the simpler case of Mendelian traits, with BOOGIE I designed a method for human blood groups prediction based on exome data. It uses a specialized version of nearest neighbor algorithm in order to match the gene variants in an unannotated exome with the ones available in a reference knowledge base. The most similar hit is used to transfer the blood group. With an accuracy above 90%, BOOGIE is a proof-of-concept that shows the potential applications of genetic prediction, and can be easily extended to any Mendelian trait. To summarize, this thesis is a partial answer to the exponential growth of sequences available that need further experiments. By integrating heterogeneous information and designing new predictive models based on machine learning, I developed novel tools for biological data analysis and classification. All implementations are freely available for the community and might be helpful during future investigations like in drug design and disease studies.In questa tesi, vengono proposti nuovi algoritmi e modelli per affrontare problemi biologici. L'informatica svolge un ruolo chiave nella proteomica e nella ricerca genetica dovuto alla gestione delle grandi moli di dati biologici. Nel contesto dello studio di proteine, ho sviluppato nuovi metodi per la predizione delle loro funzioni basati su principi di reperimento dell'informazione. Utilizzando fonti eterogenee di conoscenza, come la ricerca su grafi e la similarità di sequenze, ho progettato uno strumento chiamato INGA che può essere utilizzato per annotare interi genomi. Questo è stato valutato imparzialmente dal Critical Assessment of Function Annotation, e ha dimostrato di essere uno degli approcci più efficaci per l'inferenza di funzione. Per meglio caratterizzare le proteine dal punto di vista strutturale, ho proposto una strategia di rilevamento delle conformazioni delle proteine basata su rete di interazione di residui (RIN). Le reti RIN sono state quindi estese per gestire le fluttuazioni temporali delle coordinate atomiche. Tali grafi sono stati infine generati automaticamente da algoritmi di clustering. Un'implementazione chiamata RING MD ha evidenziato efficacemente i principali amminoacidi noti per essere funzionalmente rilevanti nell'Ubiquitina. Questi aminoacidi sono infatti molto importanti per spiegare la dinamica strutturale della proteina. Con la stessa logica, sono stati usati i grafi RIN anche per prevedere l'impatto delle mutazioni all'interno di una struttura proteica. Combinando informazioni sul nodo mutante in una rete e le sue caratteristiche, una rete neurale artificiale è stata addestrata per stimare la variazione di energia libera di Gibbs all'interno di una proteina. Cambiamenti estremi nell'energia interna potrebbe portare all'unfolding della proteina, ed eventualmente ad una malattia. D'altro canto, anche la riduzione della flessibilità proteica può ostacolare la sua funzione. Ad esempio, le fluttuazioni estreme osservate nelle proteine intrinsecamente disordinate (IDP) sono fondamentali per le loro attività. Per studiare le IDP, ho contribuito alla raccolta del più grandi dataset di regioni disordinate mai esistito. Nella seguente analisi è stato dimostrato quali sono le funzioni tipiche di queste sequenze e i processi biologici in cui sono coinvolte. Data l'importanza della loro identificazione, una valutazione globale di predittori del disordine è stata eseguita per mostrare quali sono i metodi più efficaci e le loro limitazioni. Nel contesto della genetica, mi sono concentrato sulla previsione di fenotipi. Durante il Critical Assessment of Genome Interpretation (CAGI), ho proposto nuovi approcci per l'analisi dei dati dell'esoma progettati per valutare il rischio di morbo di Crohn e di ipercolesterolemia. Queste sono spesso definite come malattie complesse, dal momento che il meccanismo alla base della loro insorgenza è ancora sconosciuto. Nel mio studio, i campioni umani con un arricchimento di mutazioni in geni critici sono stati predetti come soggetti a rischio genetico elevato. Oltre ai geni associati alla malattia, le reti di interazione proteiche sono state considerate per valutare l'accumulo di varianti in pathway biologici. Tale strategia ha dimostrato di essere tra le migliori secondo gli organizzatori del CAGI. Nel caso più semplice dei tratti mendeliani, con BOOGIE ho progettato un metodo per la predizione dei gruppi sanguigni umani basata su dati di esoma. Esso utilizza una versione specializzata dell'algoritmo nearest neighbour al fine di far corrispondere le varianti genetiche in un esoma non annotato con quelle disponibili in una base di conoscenza di riferimento. L'esempio più simile è usato per trasferire il gruppo sanguigno. Con una precisione superiore al 90%, BOOGIE è un prototipo che mostra le potenziali applicazioni della predizione genetica, e può essere facilmente esteso a qualsiasi tratto mendeliano. Riassumendo, questa tesi è una risposta parziale alla crescita esponenziale di sequenze disponibili che necessitano ulteriori esperimenti. Integrando informazioni eterogenee e la progettazione di nuovi modelli predittivi basati su apprendimento automatico, ho sviluppato nuovi strumenti per l'analisi di dati biologici e per la loro classificazione. Tutte le implementazioni sono liberamente disponibili per la comunità e potrebbero essere utili durante indagini future come in studi di malattie e nella progettazione di farmaci

    A Study on the Writer Identification Task for Paleographic Document Analysis

    No full text
    The subject of paleography is the study of ancient documents. In particular, the paleographer's aim is to locate a document in a cultural environment and chronological interval in the past. Automatic writer identification is then a desirable tool for a paleographer as she/he gains useful information about the document at hand. However, the paleographer is often interested in methods which can be easily interpretable by humans. In this paper, we apply some state-of-the-art techniques devised for modern documents to the paleographic domain. Moreover, we propose new techniques and document representations with the aim at producing more understandable representation of a writing style. Experimental results have been performed on a large dataset of paleographic images and demonstrate the feasibility of the proposed approach, and the suitability of this tool on helping the paleographer's work

    INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity

    No full text
    Identifying protein functions can be useful for numerous applications in biology. The prediction of gene ontology (GO) functional terms from sequence remains however a challenging task, as shown by the recent CAFA experiments. Here we present INGA, a web server developed to predict protein function from a combination of three orthogonal approaches. Sequence similarity and domain architecture searches are combined with protein-protein interaction network data to derive consensus predictions for GO terms using functional enrichment. The INGA server can be queried both programmatically through RESTful services and through a web interface designed for usability. The latter provides output supporting the GO term predictions with the annotating sequences. INGA is validated on the CAFA-1 data set and was recently shown to perform consistently well in the CAFA-2 blind test. The INGA web server is available from URL: http://protein.bio.unipd.it/inga
    corecore